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Abstract 

We propose Dirichlet Process mixtures of Generalized Linear Models (DP-GLM), a new 
method of nonparametric regression that accommodates continuous and categorical inputs, and 
responses that can be modeled by a generalized linear model. We prove conditions for the 
asymptotic unbiasedness of the DP-GLM regression mean function estimate. We also give 
examples for when those conditions hold, including models for compactly supported continuous 
distributions and a model with continuous covariates and categorical response. We empirically 
analyze the properties of the DP-GLM and why it provides better results than existing Dirichlet 
process mixture regression models. We evaluate DP-GLM on several data sets, comparing it to 
modern methods of nonparametric regression like CART, Bayesian trees and Gaussian processes. 
Compared to existing techniques, the DP-GLM provides a single model (and corresponding 
inference algorithms) that performs well in many regression settings. 



1 Introduction 

In this paper, we examine the general regression problem. The general regression problem models 
a response variable Y as dependent on a set of covariates x, 

y|x~/(m(x)). (1) 

The function m(x) is called the mean function, which maps the covariates to the conditional mean 
of the response; the distribution / characterizes the deviation of the response from its conditional 
mean. The simplest example of general regression is linear regression, where m(x) is a linear 
function of x and / is a Gaussian distribution with mean m{x) and fixed variance. 

The linear regression methodology is generalized to many types of response variables with 



generalized linear models (GLMs) (McCullagh and Nelder 1989). In their canonical form, a GLM 



assumes that the conditional mean of the response is a linear function of the covariates and that the 
response distribution is in an exponential family. GLMs generalize many classical regression and 
classification methods beyond linear regression, including logistic regression, multinomial regression, 
and Poisson regression. 

A considerable restriction imposed by a GLM is that the covariates must enter the distribution 
of the response through a linear function. (A non-linear function can be applied to the output of the 
linear function, but only one that does not depend on the covariates.) For real world applications 
where the distribution of the response depends on the covariates in a non- linear way, this assumption 
is limiting. Flexibly fitting non-linear response functions is the problem of nonparametric regression. 

Our goal in this paper is to develop a general-purpose method for nonparametric regression. We 
develop an algorithm that can capture arbitarily shaped response functions, model diverse response 
types and covariate types, accommodate high dimensional covariates, and capture heteroscedastic- 
ity, i.e., the property of the response distribution where both its mean and variance change with 
the covariates. 
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Our idea is to model m{x) by a mixture of simpler "local" response distributions fi(mi{x)), 
each one applicable in a region of the covariates that exhibits similar response patterns. To handle 
multiple types of responses, each local regression is a GLM. Notice this means that each mi{x) 
is a linear function — the desired non-linear mean function arises when we marginalize out the un- 
certainty about which local response distribution is in play. (See Figure [l] for a simple example 
with one covariate and a continuous response function.) Furthermore, our method captures het- 
eroscedasticity. Each GLM fi can vary in a way beyond the variability that arises from a single 
linear function of the covariates. 

Finally, we take a Bayesian nonparametric approach to determining the number of local re- 
gressions needed to explain, and form predictions about, a particular data set. With a Bayesian 
nonparametric mixture model, we let the data determine both the number and form of simple 
mean functions that are mixed. This is critical for the objective of modeling arbitrary response 
distributions: complex response functions can be constructed with many local functions, while 
simple response functions need only a small number. Unlike frequentist nonparametric regression 
methods, e.g., those that create a mean function for each data point, the Bayesian nonparametric 
approach is biased to using only as complex a model as the data allow. 

Thus, we develop Dirichlet process mixtures of generalized linear models (DP-GLMs), a Bayesian 
nonparametric regression model that combines the advantages of generalized linear models with the 
flexibility of nonparametric regression. DP-GLMs are a generalization of several existing DP-based, 
covariate/response specific regression models (Miiller et al. 1996 Shahbaba and Neal 2009) to a 
variety of response distributions. We derive Gibbs sampling algorithms for fitting and predicting 
with DP-GLMs. We investigate some of the statistical properties of these models, such as the form 
of their posterior and conditions for the asymptotic unbiasedness of their predictions. We study 
DP-GLMs with several types of data. 

In addition to defining and discussing the DP-GLM, a central contribution of this paper is 
our theoretical analysis of its response estimator and, specifically, the asymptotic unbiasedness of 
its predictions. Asymptotic properties help justify the use of certain regression models, but they 
have largely been ignored for regression models with Dirichlet process priors. We will give general 
conditions for asymptotic unbiasedness, and examples of when they are satisfied. (These conditions 
are model-dependent, and can be difficult to check.) 

The rest of this paper is organized as follows. In Section [2| we review the current research 
literature on Bayesian nonparametric regression and highlight how the DP-GLM extends this field. 
In Section [Sj we review Dirichlet process mixture models and generalized linear models. In Section 
|4j we construct the DP-GLM and derive algorithms for posterior computation. In Section [5] we 
give general conditions for unbiasedness and prove it in a specific case with conjugate priors. In 
Section [6] we compare DP-GLM and existing methods on three data sets. We illustrate that the 
DP-GLM provides a powerful nonparametric regression model that can accommodate many data 
analysis settings. 



2 Related work 



Gaussian process (GP), Bayesian regression trees and Dirichlet process mixtures are the most com- 
mon prior choices for Bayesian nonparametric regression. GP priors assume that the observations 



arise from a Gaussian process model with known covariance function form (see Rasmussen and 



Williams (2006) for a review). Without modification, however, the GP model is only applicable to 



problems with continuous covariates and constant variance. The assumption of constant covariance 
can be eased by using Dirichlet process mixtures of GPs (Rasmussen and Ghahramani 2002) or 



2 



treed GPs (Gramacy and Lee 2008). Bayesian regression trees place a prior over the size of the 



tree and can be viewed as an automatic bandwidth selection method for classification and regres- 
sion trees (CART) (Chipman et al. 1998). Bayesian trees have been expanded to include linear 



models ( [Chipman et al. 2002) and GPs (Gramacy and Lee 2008) in the leaf nodes. 

In a regression setting, the Dirichlet process has been mainly used for problems with a continuous 



response. West et al. (1994), Escobar and West (1995) and Miiller et al. (1996) used joint Gaussian 
mixtures for the covariates and response, and Rodriguez et al. (2009) generalized this method using 



dependent DPs for multiple response functionals. However, the method of Rodriguez et al. (2009) 



can be slow if a fully populated covariance matrix is used, and is potentially inaccurate if it is 
assumed diagonal. To avoid these issues — which amount to over-fitting the covariate distribution 
and under-fitting the response — some researchers have developed methods that use local weights 
on the covariates to produce local response DPs. This has been achieved with kernels and basis 



functions (Griffin and Steel 



spatial-based weights (Griffin and Steel 



2007 



2006 



2007 



Dunson et al. 2007), GPs (Gelfand et al. 2005) and general 



Duan et al. 



2007). Still other methods, again 



based on dependent DPs, capture similarities between clusters, covariates or groups of outcomes. 



including in non-continuous settings (De lorio et al. 2004, Rodriguez et al. 2009). The method 



presented here is equally applicable to the continuous response setting and tries to balance its fit 
of the covariate and response distributions by introducing local GLMs — the clustering structure is 
based on both the covariates and how the response varies with them. 

There is somewhat less research that develops Bayesian nonparametric models for other types 
of response. Mukhopadhyay and Gelfand (1997) and Ibrahim and Kleinman (1998) used a DP 



prior for the random effects portion of a GLM. Likewise, Amewou-Atisso et al. (2003) used a DP 



prior to model arbitrary symmetric error distributions in a semi-parametric linear regression model. 
While these are powerful extensions of regression models, they still maintain the assumption that 



all covariates enter the model linearly in the same way. Our work is closest to Shahbaba and 



Neal (2009). They proposed a model that mixes over both the covariates and response, where the 



response is drawn from a multinomial logistic model. The DP-GLM studied here is a generalization 
of their idea. 

Finally, asymptotic properties of Dirichlet process regression models have not been well studied. 
Most current literature centers around consistency of the posterior density for DP Gaussian mixture 



models ( [Barron et al.|1999|[Ghosal et al.|1999HGhosh and Ramamoorthi|2003l |Walker|2004l [Tokdar 



2006) and semi-parametric linear regression models (Amewou-Atisso et al. 2003 Tokdar 



Only recently have the posterior properties of DP regression estimators been studied. Rodriguez 



2006[ ). 



et al. (2009) showed point-wise asymptotic unbiasedness for their model, which uses a dependent 



Dirichlet process prior, assuming continuous covariates under different treatments with a continuous 
responses and a conjugate base measure (normal-inverse Wishart). In Section [s] we show pointwise 
asymptotic unbiasedness of the DP-GLM in both the continuous and categorical response settings. 



In the continuous response setting, our results generalize those of Rodriguez et al. (2009) and 



Rodriguez (2007). Moreover in the categorical response setting, the same theoretical framework 



provides the same consistency results for the classification model of Shahbaba and Neal ( 2009 ) . 



3 Mathematical background 

In this section we provide some mathematical background. We review Dirichlet process mixture 
models and generalized linear models. 
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Dirichlet Process Mixture Models. The Dirichlet process (DP) is a distribution over distri- 
butions (Ferguson 1 |1973D . It is denoted, 



G~DP(qGo), (2) 

where G is a random distribution. There are two parameters. The base distribution Go is a 
distribution over the same space as G, e.g., if we want G to be a distribution on reals then Go must 
be a distribution on reals too. The concentration parameter a is a positive scalar. An important 
property of the DP is that random distributions G are discrete, and each places its mass on a 
countably infinite collection of atoms drawn from Gq. 
Consider the model 

G ~ DP(a,Go) (3) 
~ G (4) 

The joint distribution of n replicates of 9i is 

p{e,.,n I a, Go) = I (fl ^(^*)) PiG)dG (5) 
One write this joint in a simpler form. Specifically, the conditional distribution of On given Oi-^n-i) 



follows a Polya urn distribution (Blackwell and MacQueen 1973), 



n-l 

i 

^n|fl:(n-l) 



^ n—1 

1-^^^^^+ ^" I '^o- (6) 

n—1 ^-^ a + n — 1 



a + - - . 

1=1 

With the chain rule, this specifies the full joint distribution of 9i:n- 

Equation Q reveals the clustering property of the joint distribution of There is a positive 
probability that each 6i will take on the value of another 9j, leading some of the draws to share 
values. This equation also makes clear the roles of scaling parameter a and base distribution 
Gq. The unique values contained in 9i:n are drawn independently from Go and the parameter a 
determines how likely ^n+i is to be a newly drawn value from Gq rather than take on one of the 
values from Oi-n- The base measure Gq controls the distribution of a newly drawn valuej^ 

In a DP mixture, is a latent parameter to an observed data point x (Antoniak 1974[ ), 



P ~ DP(aGo), 
xi\9i^ fi-\9,). 

Examining the posterior distribution of 9i:n given brings out its interpretation as an "infinite 
clustering" model. Because of the clustering property, observations are grouped by their shared 
parameters. Unlike finite clustering models, however, the number of groups is random and unknown. 
Moreover, a new data point can be assigned to a new cluster that was not previously seen in the 
data. 



^Technically, if Go is itself discrete then the "unique" values can themselves share values. 
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Generalized Linear Models. Generalized linear models (GLMs) build on linear regression to 
provide a flexible suite of predictive models. GLMs relate a linear model to a response via a 
link function; examples include familiar models like logistic regression, Poisson regression, and 



multinomial regression. See McCullagh and Nelder (1989) for a full discussion. 



GLMs have three components: the conditional probability model for response Y, the linear 
predictor and the link function. The probability model for Y, dependent on covariates X, is 

f(.y\v) = exp ( .'^^^^ + c{y, (p) 



a{<P) 

Here the canonical form of the exponential family is given, where o, 6, and c are known functions 
specific to the exponential family, (p is an arbitrary scale (dispersion) parameter, and r] is the 
canonical parameter. A linear predictor, Xf3, is used to determine the canonical parameter through 
a set of transformations. It can be shown that 6'(r/) = /U = E[y|X] ( Brown|[l986 ). However, we can 
choose a link function g such that fi = g~^(Xl3), which defines 77 in terms of X/3. The canonical 
form is useful for discussion of GLM properties, but we use the form parameterized by mean ^ 
in the rest of this paper. GLMs are simple and flexible — they are an attractive choice for a local 
approximation of a global response function. 



4 Dirichlet process mixtures of generalized linear models 

We now turn to Dirichlet process mixtures of generalized linear models (DP-GLMs), a Bayesian 
predictive model that places prior mass on a large class of response densities. Given a data set of 
covariate-response pairs, we describe Gibbs sampling algorithms for approximate posterior inference 
and prediction. Theoretical properties of the DP-GLM are developed in Section [5} 

4.1 Model formulation 

In a DP-GLM, we assume that the covariates X are modeled by a mixture of exponential-family 
distributions, the response Y is modeled by a GLM conditioned on the inputs, and that these 
models are connected by associating a set of GLM coefficients with each exponential family mixture 
component. Let 9 = {9x,0y) denote the bundle of parameters over X and I'lX, and let Go 
denote a base measure on the space of both. For example, 6x might be a set of d-dimensional 
multivariate Gaussian location and scale parameters for a vector of continuous covariates; By might 
be a d -|- 2-vector of reals for their corresponding GLM linear prediction coefficients, along with a 
GLM dispersion parameter. The full model is 

P~ Z)P(aGo), 

e = iei,,,0y,i)\p^ p, 

Xi\9i^x ~ fx{-\Bi^x), 

Yi\xi,9i,yr^GLM{-\Xi,9^,y). 

The density fx describes the covariate distribution; the GLM for y depends on the form of the 
response (continuous, count, category, or others) and how the response relates to the covariates 
(i.e., the link function). 

The Dirichlet process clusters the covariate-response pairs {x,y). When both are observed, 
i.e., in "training," the posterior distribution of this model will cluster data points according to 
near-by covariates that exhibit the same kind of relationship to their response. When the response 
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Figure 1: The top figure shows the training data (gray) fitted into clusters, with the prediction given 
a single sample from the posterior, 9^^^ (red). The bottom figure shows the smoothed regression 
estimate (black) for the Gaussian model of Equation ([T]) with the testing data (blue). Data plot 



multipole moments against power spectrum Ci for cosmic microwave background radiation ( Bennett 
et al.||2003D. 
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is not observed, its predictive expectation can be understood by clustering the covariates based 
on the training data, and then predicting the response according to the GLM associated with the 
covariates' cluster. The DP prior acts as a kernel for the covariates; instead of being a Euclidean 
metric, the DP measures the distance between two points by the probability that the hidden 
parameter is shared. See Figure [T] for a demonstration of the DP-GLM. 

We now give a few examples of the DP-GLM that will be used throughout this paper. 



Example: Gaussian Model. We now give an example of the DP-GLM for continuous covari- 
ates/response that will be used throughout the rest of the paper. For continuous covariates/response 
in M, we model locally with a Gaussian distribution for the covariates and a linear regression model 
for the response. The covariates have mean |Uj j and variance afj for the j^^ dimension of the i*'* ob- 
servation; the covariance matrix is diagonal for simplicity. The GLM parameters are the linear pre- 
dictor /3j,o, • • • , /3i,d and the response variance af y. Here, 6x,i = {fJ-i,i:d, and 9y^i = (A,0:d, cr. 
This produces a mixture of multivariate Gaussians. The full model is, 

P ~ DP{aGo), 

Xij\ei^x ~ N {fiij,crfj) , j = l,...,d, 



(7) 



Example: Multinomial Model (Shahbaba and Neal 2009). This model was proposed 



by Shahbaba and Neal (2009) for nonlinear classification, using a Gaussian mixture to model 



continuous covariates and a multinomial logistic model for a categorical response with K categories. 
The covariates have mean ^jj- and variance af^ for the j*'* dimension of the i*^ observation; the 
covariance matrix is diagonal for simplicity. The GLM parameters are the K linear predictor 
A,o,fc, • • • , I3i,d,k, k = 1,. . . ,K. The full model is. 



F{Y, = k\X^,9, 



P ~ DP{aGo), 

ei\p ~ p, 

exp (^Pifi,k + Ej=i l^iJ,kX- 



(8) 



j = l,...,d, 
k = l,...,K. 



Example: Poisson Model with Categorical Covariates. The categorical covariates are mod- 
eled by a mixture of multinomial distributions and the count response by a Poisson distribution. 
If covariate j has K categories, let {pij,i, ■ ■ ■ ,Pi,j,K) be the probabilities for categories 1, . . . , K. 
The covariates are then turned into indicator variables, l^Xij=k}j which are used with the linear 
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predictor, /3j, 0, (3i^i^i-K, ■ ■ ■ , lii,d,i:K- The full model is, 



P~L»P(aGo), (9) 
P(Xij = k\ei^x) = Pi,j,k, j = 1, • • ■ , k = l,...,K, 

/ d K \ 

Xi\Xi, Oi^y = exp I Pifi + ^ ^ A,j,fcl{Xi,^=fc} 1 ) 
\ j=ik=i ' J 

F{Yi = k\Xi,ei,y) = ^^, k = 0,1,2,.... 

We apply Model ^ to data in Section |6] 
4.2 Heteroscedasticity and overdispersion 

One advantage of the DP-GLM is that it provides a strategy for handling common problems in 
predictive modeling. Many models, such as GLMs and Gaussian processes, make assumptions 
about data dispersion and homoscedasticity. Over-dispersion occurs in single parameter GLMs 
when the data variance is larger than the variance predicted by the model mean. [Mukhopadhyay 



and Gelfand (1997) have successfully used DP mixtures over GLM intercept parameters to create 
classes of models that include over-dispersion. The DP-GLM retains this property, but is not 
limited to linearity in the covariates. 

Homoscedasticity refers to the property of variance that is constant among all covariate regions; 
heteroscedasticity is variance that changes with the covariates. Models like GLMs and Gaussian pro- 
cesses assume homoscedasticity and can give poor fits when that assumption is violated. However, 
the DP-GLM can naturally accommodate heteroscedasticity when multiparameter GLMs are used, 
such as linear, gamma and negative binomial regression models. The mixture model setting allows 
the variance parameter to vary between clusters, creating smoothly transitioning heteroscedastic 
posterior response distributions. 

A demonstration of this property is shown in Figure [2j where the DP-GLM is compared against 
a homoscedastic model, Gaussian processes, and heteroscedastic modifications of homoscedastic 
models, treed Gaussian processes and treed linear models. The DP-GLM is robust to heteroscedas- 
tic data — it provides a smooth mean function estimate, while the other models are not as robust 
or provide non-smooth estimates. 



4.3 Posterior prediction with a DP-GLM 

The DP-GLM is used in prediction problems. Given a collection of covariate-response pairs D = 
{Xi, our goal is to compute the expected response for a new set of covariates x, ¥,[Y \x,D]. 

We give the step-by-step process for formulating the model and forming the prediction. 



Choosing the mixture component and GLM. We begin by choosing fx and the GLM. The 
Dirichlet process mixture model and GLM provide flexibility both in terms of the covariates and 
the response. Dirichlet process mixture models allow nearly any type of variable to be modeled 
within the covariate mixture and subsequently transformed for use as a covariate in the GLM. 

Note that certain mixture distributions simply support certain types of covariates but may not 
necessarily be a good fit. For example, one covariate might be strictly positive and continuous. 
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Figure 2: Modeling heteroscedasticity with the DP-GLM and other Bayesian nonparametric meth- 
ods. The estimated mean function is given along with a 90% predicted confidence interval for 
the estimated underlying distribution. DP-GLM produces a smooth mean function and confidence 
interval. 
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This could be modeled with an exponential mixture, 

Xj|Ai ~ Exp{\i). 

However, although exponential mixtures have support on M+-|_, they are mixtures of a single pa- 
rameter exponential family. This has implications for the variance of the distribution, which is 
determined by the mean of each component. Therefore, a mixture of gamma distributions or pos- 
sibly even Gaussians would be a better fit. Both the gamma and Gaussian distributions have a 
mean and dispersion parameter, which free the mixture variance from the mean. 
The GLM — for the conditional response — is chosen in much the same way. 



Choosing the base measure and other hyperparameters. The choice of the base measure 
Go affects how expressive the DP-GLM is, the computational efficiency of the prediction and 
whether some theoretical properties, such as asymptotic unbiasedness, hold. For example. Go 
for the Gaussian model is a distribution over {/J-i, (Ji, Pi^-d, (Ji^y). A conjugate base measure is 
normal-inverse-gamma for each covariate dimension and multivariate normal inverse-gamma for 
the response parameters. This Go allows all continuous, integrable distributions to be supported, 
retains theoretical properties, such as asymptotic unbiasedness, and yields highly efficient posterior 
sampling by allowing the Gibbs sampler to be collapsed (Neal 2000). However, this base measure 
is not expressive for small amounts of data. The tails quickly decline and the mean is tied to the 
variance. In summary, the base measure is often chosen in accordance to data size, distribution 
type, distribution features (heterogeneity, etc) and computational constraints. 

Hyperparameters for the DP-GLM include the DP scaling parameter a and hyperparameters 
parameters for the base measure Go- It is often useful to place a gamma prior on a (Escobar and 
West 1995), while the parameters for Go may have their own prior as well 



Each level of priors 



reduces their influence but adds computational complexity (Escobar and West 1995). 



Approximating the posterior and forming predictions. Our ultimate goal is to form a 
conditional expectation of the response, given a new set of covariates x and the observed data 
D, E[y \X = x,D]. Following the Bayesian regression methodology, we use iterated expectation, 
conditioning on the latent variables, 

E[Y\X = x,D]=E[E[Y\X = x, Oi-.n] \ D] . (10) 

The inner expectation is straight-forward to compute. Conditional on the latent parameters 6i-n 
that generated the observed data, the expectation of the response is 

^rvDT- . _ a Ir^[Y\X = X, 9] Ux\e)Go{d9) + ELi ^ [^1^ = G^] f^m) 

^[Y\A-x,u^..ni- af^Ux\e)Go{de) + Y:=J.m) ■ 

Since Y is assumed to be a GLM, the quantity E = x, 6] is analytically available as a function 
of X and 9. 

The outer expectation of Equation ( 10 ) is generally intractable. We approximate it by Monte 
Carlo integration using M posterior samples of 9i-n, 



M 

E[Y\X = x,D]^-^e[y\X = x, ^S] . (12) 

m=l 



The observations ^^l^'* are i.i.d. from the posterior distribution of 9i:n \ D. 
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We use Markov chain Monte Carlo (MCMC) to obtain M i.i.d. samples from this distribution. 
Specifically, we use Gibbs sampling, which is an effective algorithm for DP mixture models. (See 
Escobar| ( |1994[ ), |MacEachern| (|1994[ ), [Escobar and West] ( |1995| ) and [MacEachern and Miiller| p^98| ) 
for foundational work; Neal (2000) provides a modern treatment and state of the art algorithms.) 
In short, we construct a Markov chain on the hidden variables 9i-n such that its limiting distribution 
is the posterior of interest. Details for its implementation are given in Appendix | A- 1[ 



4.4 Comparison to the Dirichlet process mixture model regression 

The DP-GLM directly models Y as being conditioned on X. Modeling the joint distribution of 
{x,y) as coming from a common mixture component in a classical DP mixture (see Section [s]) also 
induces a conditional distribution of Y given X. In this Section, we conceptually compare these 
two approaches. (They are compared empirically in this section and Section [6|) 
A generic Dirichlet process mixture model (DPMM) has the form, 



P~ L>P(aGo), 



(13) 



DPMMs with this form have been studied for regression (Escobar and West 1995), but have gen- 



erally not been used in practice due to poor results (with a diagonal covariance matrix) or com- 
putational difficulties (with a full covariance matrix). We focus on the former case, with diagonal 
covariance, to study why it has poor results and how the DP-GLM improves on these with a min- 



imal increase in computational difficulty. The difference between Model (13) and the DP-GLM is 



that the distribution of Y given 9 is conditionally independent of the covariates X. This is a small 
difference that has large consequences for the posterior distribution and predictive results. 



Consider the log- likelihood of the posterior of the DPMM given in Model (13). Assume that fy 
is a single parameter exponential, where 9y 



/3, 



K 



i=l 



(14) 



The log-likelihood of the DP-GLM posterior for a single parameter exponential family GLM, where 
(^y = (/3o) ■ • • ) /3d), has the form, 



K 



Im I 



i=l 



E 

j=0 



E 

cec,; 



0Ci 



+ E' 



D) 



(15) 



As the number of covariates grows, the likelihood associated with the covariates grows in both 
equations. However, the likelihood associated with the response also grows with the extra response 



parameters in Equation (15), whereas it is fixed in Equation (14). 



These posterior differences lead to two predictive differences: 1) the DP-GLM is much more 
resistant to dimensionality than the DPMM, and 2) as the dimensionality grows, the DP-GLM 
produces less stable predictions than the DPMM. Since the number of response related parameters 
grows with the number of covariate dimensions in the DP-GLM, the relative posterior weight of 
the response does not shrink as quickly in the DP-GLM as it does in the DPMM. This keeps the 
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response variable relatively important in the selection of the mixture components and hence makes 
the DP-GLM a better predictor than the DPMM as the number of dimensions grows. However, 
each additional f3j adds some noise, so when d is large the DP-GLM estimate tends to be noisy. 

While the additional GLM parameters help maintain the relevance of the response, they also 
add noise to the prediction. This can be seen in Figure [3j The GLM parameters in this figure have 
a Gaussian base measure, effectively creating a local ridge regression In lower dimensions, the 
DP-GLM produced more stable results than the DPMM because a smaller number of larger clusters 
were required to fit the data well. The DPMM, however, consistently produced stable results in 
higher dimensions as the response became more of a sample average than a local average. The 
DPMM has the potential to predict well if changes in the mean function coincide with underlying 
local modes of the covariate density. However, the DP-GLM forces the covariates into clusters that 
coincide more with the response variable due to the inclusion of the slope parameters. 

We now discuss the theoretical properties of the DP-GLM. 



5 Asymptotic Unbiasedness of the DP-GLM Regression Model 

A desirable property of any estimator is that it should be unbiased, particularly in the limit. 



Diaconis and Preedman ( 1986 ) gives an example of a location model with a Dirichlet process prior 
where the estimated location can be bounded away from the true location, even when the number 
of observations approaches infinity. We want to assure that DP-GLM does not end up in a similar 
position. 

Notation for this section is more complicated than the notation for the model. Let fo{x,y) be 
the true joint distribution of (x,y); in this case, we will assume that /q is a density. Let J-' be the 
set of all density functions over (x, y). Let H-^ be the prior over induced by the DP-GLM model. 
Let IE/o[-] denote the expectation under the true distribution and IEn/['] be the expectation under 
the prior 

In general, an estimator is a function of observations. Assuming a true distribution of those 
observations, an estimator is called unbiased if its expectation under that distribution is equal to 
the value that it estimates. In the case of DP-GLM, that would mean for every x & A and every 
n > 0, 

Ef,[En[Y\x,{X„Yi)U]] =Ef,[Y\x], 

where A is some fixed domain. En is the expectation with respect to the prior 11 and Ej^ is the 
expectation with respect to the true distribution. 



Since we use Bayesian priors in DP-GLM, we will have bias in almost all cases (Gelman et al. 



2004). The best we can hope for is asymptotic unbiasedness, where as the number of observations 
grows to infinity, the mean function estimate converges to the true mean function. That is, for 
every x G A, 

En[Y\x,{Xi,Yi)'^^^]^E[Y\x] as n ^ oo. 



Diaconis and Freedman ( 1986[ ) give an example for a location problem with a DP prior where the 



posterior estimate was not asymptotically unbiased. Extending that example, it follows that esti- 
mators with DP priors do not automatically receive asymptotic unbiasedness. We use consistency 
and uniform integrability to give conditions for asymptotic unbiasedness. 

Consistency, the notion that as the number of observations goes to infinity the posterior distri- 
bution accumulates in neighborhoods arbitrarily "close" to the true distribution, is tightly related 



■^In unpublished results, we have also tried other sparsity-inducing base measures, such as a Laplacian distribution 
with an LI penalty. They produced less stable results than the Gaussian base measure, likely due to the sample size 
differences between the clusters. 
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Comparison over Dimensions 
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Figure 3: A plain Dirichlet process mixture model regression (left) versus DP-GLM, plotted against 
the number of spurious dimensions (vertical plots). The estimated mean function is given along 
with a 90% predicted confidence interval for the estimated underlying distribution. Data have 
one predictive covariate and a varying number of spurious covariates. The covariate data were 
generated by a mixture model. DP-GLM produces a smoother mean function and is much more 
resistant to spurious dimensionality. 
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to both asymptotic unbiasedness and mean function estimate existence. Weak consistency as- 
sures that the posterior distribution accumulates in regions of densities where "properly behaved" 
functions (i.e., bounded and continuous) integrated with respect to the densities in the region are 
arbitrarily close to the integral with respect to the true density. The expectation may not be 
bounded; in addition to weak consistency, uniform integrability is needed to guarantee that the 
posterior expectation converges to the true expectation, giving asymptotic unbiasedness. Uniform 
integrability also ensures that the posterior expectation almost surely exists with every additional 
observation. Therefore we need to show weak consistency and uniform integrability. 



5.1 Asymptotic Unbiasedness 



We approach asymptotic unbiasedness by showing weak consistency for the posterior of the joint dis- 
tribution and then using uniform integrability to show that the conditional expectation E^/ [y | X = 
X, (Xi,^)"=i] exists for every n and converges to the true expectation almost surely. The proof for 
this is a bit involved, so it has been placed in the Appendix. 

Theorem 5.1. Let x be in a compact set C and II-^ be a prior on T . If, 

(i) for every (5 > 0, II-^ puts positive measure on 

f: [ fo{x,y)log^^^j^dxdy<6 
J f{x,y) 

(a) J \y\fo{y\x)dy < cxd for every x £ C, and 
(Hi) there exists an e > such that for every x £ C, 



\'^'fy{y\x, 



o{d9) < oo. 



then for every n > 0, En[y|x, {Xi, Yi)i;n] exists and has the limit Ejg[y|X = x], almost surely with 
respect to the observation product measure, ^fg°- 



The conditions of Theorem 



5.1 



must be checked for the problem (/o) and prior (n-^) pair, and 
can be difficult to show. Condition (i) assures weak consistency of the posterior, condition (m) guar- 
antees a mean function exists in the limit and condition {Hi) guarantees that positive probability 
is only placed on densities that yield a finite mean function estimate, which is used for uniform 
integrability. Condition (i) is quite hard to show, but there has been recent work demonstrating 



weak consistency for a number of Dirichlet process mixture models and priors dGhosal et aljpQ^ 



for a proof of Theorem 5.1 



Ghosh and Ramamoorthi 


2003 


Amewou-Atisso et al. 


2003 


Tokdar 


2006 


). See the Appendix 


A-2 



5.2 Asymptotic Unbiasedness Example: Gaussian Model 

Theorem |5.2| gives conditions for when Theorem |5.1| holds for the Gaussian model. We will then 



given some examples of when Theorem 5.2 holds. 
Theorem 5.2. Suppose that: 
(i) fo{x,y) is absolutely continuous with respect to M.'^^'^ and has compact support, 
(a) all location-scale-slope parameter mixtures are in the weak support of Go, and 
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(Hi) Gq satisfies assumption {Hi) of Theorem 5.1 
Then, the conclusions of Theorem\5. 1\ hold. 



See the Appendix A-3 for the proof. Examples that satisfy Theorem 5.2 are as follows, with 
technical results in the Appendix. 

Normal-Inverse- Wishart. Note that in the Gaussian case, slope parameters can be generated 
by a full covariance matrix: using a conjugate prior, a Normal-Inverse- Wishart, will produce an 



instance of the DP-GLM. Define the following model, which was used by Miiller et al. (1996), 



P ~ DPi 

{Xi,Y,)\9,^N{^l,J:) 



(16) 



The last line of Model ( 16 ) can be broken down in the following manner, 



where 



b 



l:d 



We can then define /3 as, 

/3o = fly- b^T.-^iJx, 

The base measure Go is defined as, 

(^, S) ~ Normal Inverse Wishart{\, v, a, B). 

Here A is a mean vector, z/ is a scaling parameter for the mean, a is a scaling parameter for the 
covariance, and is a covariance matrix. 



Diagonal Normal-Inverse-Gamma. It is often more computationally efficient to specify that 
Tlx is a diagonal matrix. In this case, we can specify a conjugate base measure component by 
component: 



fjjj ~ Inverse Gamma{aj,bj), 
(^i.y ~ Inverse Gamma{ay,by), 

\<yi,y ~ ^d+li^y,f^y/^y)- 



j = l,...,d, 
j = l,...,d, 



The Gibbs sampler can still be collapsed, but the computational cost is much lower than the full 
Normal-Inverse- Wishart . 
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Normal Mean, Log Normal Variance. Conjugate base measures tie the mean to the variance 
and can be a poor fit for small, heteroscedastic data sets. The following base measure was proposed 



Shahbaba and Neal (2009), 



log(o-ij) ~ N{mj^^,sl^), 
5.3 Asymptotic Unbiasedness Example: Multinomial Model 



j = y, 


1,.. 


.,d, 


j = 


1,.. 


.,d, 


j = 


0,.. 


.,d. 



Now consider the multinomial model of [Shahbaba and Neal (2009), given in Model ([8|), 

P ~ DP{aGo), 



HYi = k\X, 



exp (A,o,fc + Ej=i l^i,j,kXi,j 



E£i exp (/3i,o/ + Ej=i 



l,...,d, 
1,...,K 



Theorem 5.3 gives conditions for when Theorem 1 5 . 1 1 holds for the Multinomial model. We will then 



given some examples of when Theorem 5.3 holds. 

Theorem 5.3. Suppose that: 
(i) fo{x) is absolutely continuous with respect to M*^ and has compact support, 
(a) ¥{Y = k\x) is continuous in x for every x ^ C and k = 1, . . . ,K, 
(Hi) all location- scale- slope parameter mixtures are in the weak support o/Gq, and 



(iv) Go satisfies assumption (iii) of Theorem 5.1 
Then, the conclusions of Theorem\5. 1\ hold. 



See Appendix A-4 for the proof. We now give some examples of base measures that satisfy 
Theorem |5.3[ In general, the base measure for the continuous covariates is the same as those in the 
Gaussian model, while the GLM parameters are given a Gaussian base measure. Technical results 



are discussed in the Appendix A-4 



Normal-Inverse-Wishart. The covariates have a Normal-Inverse- Wishart base measure while 
the GLM parameters have a Gaussian base measure, 

{fj-i^x,'^i,x) ~ Normal Inverse Wishart{X, v, a, B), 

l3ij^k N{mj^k,slk), j = 0,...,d, k = l,...,K. 

Diagonal Normal-Inverse-Gamma. It is often more computationally efficient to specify that 
Tlx is a diagonal matrix. Again, we can specify a conjugate base measure component by component 
while keeping the Gaussian base measure on the GLM components. 



l^i,j I 
Pi,j,k I '^ij-y 



Inverse Gamma{aj,bj), 



ai.i ~ N{Xj,aij/uj) 



Ni'mj,k,slk), 



j = l,...,d, 
j = l,...,d, 
j = 0,...,d, k = l,...,K. 
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Normal Mean, Log Normal Variance. Likewise, for heteroscedastic covariates we can use the 



log normal base measure of jShahbaba and Neal (2009), 



log(crij) ~ iV(mj-CT,s|^), 



"hill 
2 



J = 0, . . . , d, k-- 



--l,...,d, 
--l,...,d, 
1 K. 



6 Empirical study 

We compare the performance of DP-GLM regression to other regression methods. We studied data 
sets that illustrate the strengths of the DP-GLM, including robustness with respect to data type, 
heteroscedasticity and higher dimensionality than can be approached with traditional methods. 



Shahbaba and Neal (2009) used a similar model on data with categorical covariates and count 



responses; their numerical results were encouraging. We tested the DP-GLM on the following 
datasets. 

Datasets. We selected three data sets with continuous response variables. They highlight various 
data difficulties within regression, such as error heteroscedasticity, moderate dimensionality (10-12 
covariates), various input types and response types. 



Cosmic Microwave Background (CMB) [Bennett et al. (2003). The data set consists 
of 899 observations which map positive integers i = 1,2,..., 899, called 'multipole moments,' 
to the power spectrum Ci. Both the covariate and response are considered continuous. The 
data pose challenges because they are highly nonlinear and heteroscedastic. Since this data set 
is only two dimensions, it allows us to easily demonstrate how the various methods approach 
estimating a mean function while dealing with non-linearity and heteroscedasticity. 



Concrete Compressive Strength (CCS) Yeh (1998). The data set has eight covariates: 



the components cement, blast furnace slag, fly ash, water, superplasticizer, coarse aggregate 
and fine aggregate, all measured in kg per m^, and the age of the mixture in days; all 
are continuous. The response is the compressive strength of the resulting concrete, also 
continuous. There are 1,030 observations. The data have relatively little noise. Difficulties 
arise from the moderate dimensionality of the data. 



Solar Flare (Solar) Bradshaw (1989). The response is the number of solar flares in a 24 



hour period in a given area; there are 11 categorical covariates. 7 covariates are binary and 
4 have 3 to 6 classes for a total of 22 categories. The response is the sum of all types of solar 
flares for the area. There are 1,389 observations. Difficulties are created by the moderately 
high dimensionality, categorical covariates and count response. Few regression methods can 
appropriately model this data. 

Dataset testing sizes ranged from very small (20 observations) to moderate sized (800 observations). 
Small dataset sizes were included due to interests in (future) online applications. 



Competitors. The competitors represent a variety of regression methods; some methods are only 
suitable for certain types of regression problems. 

• Ordinary Least Squares (OLS). A parametric method that often provides a reasonable 
fit when there are few observations. Although OLS can be extended for use with any set 
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of basis functions, finding basis functions that span the true function is a difficult task. We 
naively choose [1 Xi . . . X^]^ as basis functions. OLS can be modified to accommodate both 
continuous and categorical inputs, but it requires a continuous response function. 

CART. A nonparametric tree regression method generated by the Matlab function class- 
regtree. It accommodates both continuous and categorical inputs and any type of response. 



Bayesian CART. A tree regression model with a prior over tree size dChipman eFallpQSl ); 
it was implemented in R with the tgp package. 

Bayesian Treed Linear Model. A tree regression model with a prior over tree size and a 



linear model in each of the leaves (Chipman et al. 2002); it was implemented in R with the 
tgp package. 

Gaussian Processes (GP). A nonparametric method that can accommodate only contin- 
uous inputs and continuous responses. GPs were generated in Matlab by the program gpr of 



Rasmussen and Williams (2006). It is suitable only for continuous responses and covariates. 



• Treed Gaussian Processes. A tree regression model with a prior over tree size and a GP 
on each leaf node ( Gramacy and Lee||2008 ); it was implemented in R with the tgp package. 



Basic DP Regression. Similar to DP-GLM, except the response is a function only of /Xy, 
rather than /3o + X] fiiXi. For the Gaussian model, 

P~ Z)P(aGo), 
Ai|6'i ~ N{fii^^,af^), 



This model was explored in Section 4.4 



• Poisson GLM (GLM). A Poisson generalized linear model, used on the Solar Flare data 
set. It is suitable for count responses. 

Cosmic Microwave Background (CMB) Results. For this dataset, we used a Guassian 
model with base measure 

~ Nim^, si), al ~ exp {N{m^^s, ; 

/3o:d ~ N{myfl.,d, 4,0:d)' ~ exp {N{m^^s, ■ 

This prior was chosen because the variance tails are heavier than an inverse gamma and the mean 
is not tied to the variance. It is a good choice for heterogeneous data because of those features. 
Computational details are given in Appendix A-5[ 



All non-linear methods except for CART (DP-GLM, Bayesian CART, treed linear models, 
GPs and treed GPs) did comparably on this dataset; CART had difficulty finding an appropriate 
bandwidth. Linear regression did poorly due to the non-linearity of the dataset. Fits for het- 
eroscedasticity for the DP-GLM, GPs, treed GPs and treed linear models on 250 training data 
points can be seen in Figure [2] See Figure |4] and Table [T] for results. 



18 



CMB Dataset 




5 10 30 50 100 250 



Number of Observations 



Figure 4: The average mean absolute error (top) and mean squared error (bottom) for ordinary 
least squares (OLS), tree regression, Gaussian processes and DP-GLM on the CMB data set. The 
data were normalized. Mean +/— one standard deviation are given for each method. 



Method 


Mean Absolute Error 




Mean Square Error 




Training set size 


30 


50 


100 


250 


500 


30 


50 


100 


250 


500 


DP-GLM 


0.58 


0.51 


0.49 


0.48 


0.45 


1.00 


0.94 


0.91 


0.94 


0.83 


Linear Regression 


0.66 


0.65 


0.63 


0.65 


0.63 


1.08 


1.04 


1.01 


1.04 


0.96 


CART 


0.62 


0.60 


0.60 


0.56 


0.56 


1.45 


1.34 


1.43 


1.29 


1.41 


Bayesian cart 


0.66 


0.64 


0.54 


0.50 


0.47 


1.04 


1.01 


0.93 


0.94 


0.84 


Treed Linear Model 


0.64 


0.52 


0.49 


0.48 


0.46 


1.10 


0.95 


0.93 


0.95 


0.85 


Gaussian Process 


0.55 


0.53 


0.50 


0.51 


0.47 


1.06 


0.97 


0.93 


0.96 


0.85 


Treed GP 


0.52 


0.49 


0.48 


0.48 


0.46 


1.03 


0.95 


0.95 


0.96 


0.89 



Table 1: Mean absolute and square errors for methods on the CMB data set by training data size. 
The best results for each size of training data are in bold. 
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CCS Dataset 
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Gaussian 
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DP 



Figure 5: The average mean absolute error (top) and mean squared error (bottom) for ordinary least 
squares (OLS), tree regression, Gaussian processes, location/scale DP and the DP-GLM Poisson 
model on the CCS data set. The data were normalized. Mean +/— one standard deviation are 
given for each method. 



Concrete Compressive Strength (CCS) Results. The CCS dataset was chosen because of 
its moderately high dimensionality and continuous covariates and response. For this dataset, we 
used a Gaussian model and a conjugate base measure with conditionally independent covariate and 
response parameters, 

{fix,(^x) ~ Normal — Inverse — Gamma{mx, Sx, ax,bx) , 
(/3o:d) (^y) ~ Multivariate Normal — Inverse — Gamma{My, Sy, ay, by). 

This base measure allows the sampler to be fully collapsed but has fewer covariate-associated pa- 
rameters than a full Normal-Inverse- Wishart base measure, giving it a better fit in a moderate 
dimensional setting. In testing, it also provided better results for this dataset than the exponenti- 
ated Normal base measure used for the CMB dataset; this is likely due to the low noise and variance 



of the CCS dataset. Computational details are given in Appendix A- 6 

Results on this dataset were more varied than those for the CMB dataset. GPs had the best 
performance overall; on smaller sets of training data, the DP-GLM outperformed frequentist CART. 
Linear regression, basic DP regression and Bayesian CART all performed comparatively poorly. 
Treed linear models and treed GPs performed very well most of the time, but had convergence 
problems leading to overall higher levels of predictive error. Convergence issues were likely caused 
by the moderate dimensionality (8 covariates) of the dataset. See Figure [s] and Table [2] for results. 
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Method 


Mean Absolute Error 




Mean Square Error 






30 


50 


100 


250 


500 


30 


50 


100 


250 


500 


DP-GLM 


0.54 


0.50 


0.45 


0.42 


0.40 


0.47 


0.41 


0.33 


0.28 


0.27 


Location/Scale DP 


0.66 


0.62 


0.58 


0.56 


0.54 


0.68 


0.59 


0.52 


0.48 


0.45 


Linear Regression 


0.61 


0.56 


0.51 


0.50 


0.50 


0.66 


0.50 


0.43 


0.41 


0.40 


CART 


0.72 


0.62 


0.52 


0.43 


0.34 


0.87 


0.65 


0.46 


0.33 


0.23 


Bayesian cart 


0.78 


0.72 


0.63 


0.55 


0.54 


0.95 


0.80 


0.61 


0.49 


0.46 


Treed Linear Model 


1.08 


0.95 


0.60 


0.35 


1.10 


7.85 


9.56 


4.28 


0.26 


1232 


Gaussian Process 


0.53 


0.52 


0.38 


0.31 


0.26 


0.49 


0.45 


0.26 


0.18 


0.14 


Treed CP 


0.73 


0.40 


0.47 


0.28 


0.22 


1.40 


0.30 


3.40 


0.20 


0.11 



Table 2: Mean absolute and square errors for methods on the CCS data set by training data size. 
The best results for each size of training data are in bold. 



Method 


Mean Absolute Error 




Mean Square Error 






50 


100 


200 


500 


800 


50 


100 


200 


500 


800 


DP-GLM 


0.52 


0.49 


0.48 


0.45 


0.44 


0.84 


0.76 


0.71 


0.69 


0.63 


PoissoN Regression 


0.65 


0.59 


0.54 


0.52 


0.48 


0.87 


0.84 


0.80 


0.73 


0.64 


CART 


0.53 


0.48 


0.50 


0.47 


0.47 


1.13 


0.88 


1.03 


0.88 


0.83 


Bayesian CART 


0.59 


0.52 


0.51 


0.47 


0.45 


0.86 


0.80 


0.78 


0.71 


0.60 



Table 3: Mean absolute and square errors for methods on the Solar data set by training data size. 
The best results for each size of training data are in bold. 

Solar Flare Results. The Solar dataset was chosen to demonstrate the flexibility of DP-GLM. 
Many regression techniques cannot accommodate categorical covariates and most cannot accom- 
modate a count-type response. For this dataset, we used the following DP-GLM, 

P 

Oi\P 
Xij I 6i 

We used a conjugate covariate base measure and a Gaussian base measure for /3, 

{Pj,i,.. ■ ,Pj,Kij)) ~ Dirichlet{aj^i, . . .,aj^K{j)), Pj,k ~ iV(mj,fc, s^^). 

Computational details are given in Appendix |A-7 

The only other methods that can handle this dataset are CART, Bayesian CART and Poisson 
regression. The DP-GLM had the best performance under both error measures (with Bayesian 
CART a close second). The high mean squared error values suggests that frequentist CART overfit 
while the high mean absolute error for Poisson regression suggests that it did not adequately fit 
nonlinearities. See Figure |6] and Table [3] for results. 

Discussion. The DP-GLM is a relatively strong competitor on all of the datasets, but only CART 
and Bayesian CART match its flexibility. It was more stable than most of its Bayesian competitors 
(aside from GPs) on the CCS dataset. Our results suggest that the DP-GLM would be a good 



DP(qGo), 
P, 

(PiJ,!' • • • ^Pi,j,K{j)), 

I d K(j) 

Poisson (3ifl + ^ X] f^iJ>k^{x^,j=k} 
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Figure 6: The average mean absolute error (top) and mean squared error (bottom) for tree re- 
gression, a Poisson GLM (GLM) and DP-GLM on the Solar data set. Mean +/— one standard 
deviation are given for each method. 
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choice for small sample sizes when there is significant prior knowledge; in those cases, it acts as 
an automatic outlier detector and produces a result that is similar to a Bayesian GLM. Results 
from Section |4] suggest that the DP-GLM is not appropriate for problems with high dimensional 
covariates; in those cases, the covariate posterior swamps the response posterior with poor numerical 
results. 



7 Conclusions and Future Work 

We developed the Dirichlet process mixture of generalized linear models (DP-GLM), a flexible 
Bayesian regression technique. We discussed its statistical and empirical properties; we gave con- 
ditions for asymptotic unbiasedness and gave situations in which they hold; finally, we tested the 
DP-GLM on a variety of datasets against state of the art Bayesian competitors. The DP-GLM was 
competitive in most setting and provided stable, conservative estimates, even with extremely small 
sample sizes. 

One concern with the DP-GLM is computational efficiency as implemented. All results were 
generated using MCMC, which does not scale well to large datasets. An alternative implementation 
using variational inference ( Blei and Jordan|2005 ) , possibly online variational inference ( Sato|200T ), 



would greatly increase computational feasibility for large datasets. 

Our empirical analysis of the DP-GLM has implications for regression methods that rely on 
modeling a joint posterior distribution of the covariates and the response. Our experiments suggest 
that the covariate posterior can swamp the response posterior, but careful modeling can mitigate 
the effects for problems with low to moderate dimensionality. A better understanding would allow 
us to know when and how such modeling problems can be avoided. 



Appendix 

A-1 Posterior Inference 

In the Gibbs sampler, the state is the collection of labels (zi, . . . , z„) and parameters {61, ... , 0*^), 
where 6* is the parameter associated with cluster c and K is the number of unique labels given 
zi-n. In a collapsed Gibbs sampler, all or part of {9\, . . . , 6"^) is eliminated through integration. Let 
z-i = (zi, . . . , Zi-i, Zj+i, . . . , Zn). A basic inference algorithm is given in Algorithm [T] Convergence 



Algorithm 1: Gibbs Sampling Algorithm for the DP-GLM 
Require: Starting state (zi, . . . , Zn), {O^, . . . , 6'|^), convergence criteria. 



1 


repeat 


2 


for i = 1 to n do 


3 


Sample Zi from p{zi D, z-i, Ol.j^). 


4 


end for 


5 


for c = 1 to do 


6 


Sample 9* given {{Xi,Yi) : Zi = c} 


7 


end for 


8 


if Convergence criteria are met then 


9 


Record (zi, . . . , z„) and {6l,...,9*^ 


10 


end if 


11 


until M posterior samples obtained. 
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criteria for our numerical examples are given in Appendix sections A- 5 to A-7 See Gelman et al. 



( 2004 ) for a more complete discussion on convergence criteria. 



We can sample from the distribution p{zi \ D, z-i, 0l.j^) as follows, 

p{zi I D, z-i,dl.j^) oc p{zi I z^i)p{Xi I zi;„, D, 9l.j^)p{Yi \ Xi, zi,n, D, 6**.^). 
The first part of Equation ( |A-1 ) is the Chinese Restaurant Process posterior value, 



(A-1) 



P{Zi I Z-i) 



n-l+a if = i ^ ^' 

n-l+a if ^ ^ ^• 



Here Uz is the number of elements with the label Zj. The second term of Equation (A-1) is the 



same as in other Gibbs sampling algorithms. If possible, the component parameters ^^.^ can be 
integrated out (in the case of conjugate base measures and parameters that pertain strictly to the 
covariates) and p{Xi \ zi-n, D, O^.j^) can be replaced with 



p{x,\zi.,n,D,eiK)p{eiK\zi.,n)d9i 



K- 



The third term of Equation (A-1) is not found in traditional Dirichlet process mixture model 



samplers. In some cases, this term can also be collapsed, such as Gaussian model with a Normal- 
Inverse-Gamma base measure. In that case, 



p{Yi \X^,Zc,Dc 



r((n + l)/2) (^^^^)-i/2 /'_i/2(„„ + 1) log (l + J-{Y- m,, f 
r(n„/2) \ \ n„s„ 



V 



m„ ^ Xiihn, 
rin = riyo + He, 

si = 4 



(4o + 1/2 {rnoV'm^ + Y^^Y, - m^l/-im„)) / ([ny^ + n,)X,VXr 

[IXi], ric is the number of data 



Here, we define Xc = {[^Xj] : zj = Zc}, Yc = {Yj : zj = Zc}, X, 
associated with label Zc and the base measure is define as. 



ay ~ Inverse — Gamma{nyQ, Syo), 
f3\alr^ N{mo,alV). 



A-2 Proof of Theorem 15.11 



The outline for the proof of Theorem 5.1 is as follows: 



1) Show consistency under the set of conditions given in part (i) of Theorem 5.1 



2) Show that consistency and conditions (ii) and (in) imply the existence and convergence of 
the expectation. 

a) Show that weak consistency implies pointwise convergence of the conditional density. 

b) Note that pointwise convergence of the conditional density implies convergence of the 
conditional expectation. 
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The first part of the proof rehes on a theorem by Schwartz (1965). 



Theorem A-1 ( |Schwartz|(l965[ )). LetU-^ be a 
on all neighborhoods 



prior on 



T. Then, ifH-^ places positive probability 



J f{x,y) 
for every 6 > 0, then H-f is weakly consistent at /q. 

Therefore, consistency of the posterior at /o follows directly from condition (i). 
We now show pointwise convergence of the conditional densities. Let fn{x,y) be the Bayes 
estimate of the density under II-^ after n observations, 

Ux, y) = fi^, y)n^ (df I {Xi, Yi)^=i) . 

Posterior consistency of II-^ at /o implies that fn{x,y) converges weakly to fo{x,y), pointwise in 
{x,y)- 

Lemma A- 2. Suppose that li^^ is weakly consistent at fo . Then fn{y\x) converges weakly to fo{y\x) . 

Proof. If II-^ is weakly consistent at /o, then fn{x, y) converges weakly to /o(x, y). We can integrate 
out y to see that fn[x) also converges weakly to fo{x). Then note that 



lim fniy\x) 



y fn{x,y) fo{x,y) . / I X 
hm „ . . = „ . . = fo[y\x). 



fn{x) 



h{x) 



□ 



Now we are ready to prove Theorem |5.1 



Theorem \5.1[ Existence of expectations: by condition [Hi) of Theorem 5.1 

\y\^^^ fn{y\x)dy < oo 



almost surely. This and condition (ii) ensure uniform integrability and that E^/ [F | X = x, (Xi, 1^)"^^^] 



exists for every n almost surely. By Lemma A-2, the conditional density converges weakly to the 
true conditional density; weak convergence and uniform integrability ensure that the expectations 
converge: E^/ [y | X = x, (Xi, Yi)^^-^] converges pointwise in x to Ej^ [y | X = x]. 

□ 



A-3 Proof of Theorem 15.21 and Results for Base Measures 

We sketch the proof of this theorem in two parts. First, we show that we can use a Gaussian 
convolution approximation to /o(x,y), and second that the posterior of the convolution is weakly 
consistent at /q. We do this only for the case where x is a scalar, but it can be easily extended to 
the vector case. 

Uniform equicontinuity will be used throughout this proof and the next. 

Definition A-3. A set of functions T is uniformly equicontinuous if for every e > 0, there exists 
a 6 > such that for every xi, X2 G X, if \ |xi — X2I | < S, then |/(xi) — /(x2)| < e for all f £ T. 
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Theorem A-4. Suppose that fo{x,y) has compact support; that is, there exists some x' and y' 
such that fo{x,y) = whenever {x,y) ^ [—x',x'] x [—y',y']. Set 

h,h{x,y) = j j j foif^,/3o,l3i)4>hA^ - f^)<Phyiy - 1^0- i3ifJ-)dnd(3odf3i, 

where 4>hix) is the Gaussian pdf with variance h, evaluated at x. Then 

. ^i™ n / / -^o^^' -p^^T^dxdy = 0. 
K,hy^Qj J JoMx,y) 



Proof. This proof follows Remark 3 of Ghosal et al. ( |1999 ) closely. Fix < ci < C2. Because /o is 



compactly supported, there exists a' < b' , a" < b" , and a'" < b'" such that for every 

(/.,/3o,/3i)G[a',6']x[a",6"]x[a'",r]=P, 

/o(/U,^o + > ci and 

(^,/3o,/3i) ^P, 

/o(/U, /3o+/3i/u) < C2. We can choose /io,x) ^o,x such that A^(0, /io,x) gives 1/3 probability to (0, b' — a'). 
Define 

di = max (/3o + /3ia') , d2 = min (/3o + /3i6') , 

A)e[a",fe"],fte[a"',b"'] ^ ^ /3oe[a",6"],/3ie[a"',6"'] ^ ^ 

d\ = min (/3o + 0ia') , do = max (Bq + . 

,3oe[a",fe"]„9iG[a"',6"'] ' /3oe[a",fe"],/3ie[a"',6"'] ^ 

By choice of ci,C2, we can ensure that di < d2. We can choose /io,j/ such that N{0,ho^y) gives 1/3 
probability to (0, — d'l)- Then we compute lower bounds on the convolutions for each part of our 
partitioned space. Let hx < /io,x and hy < h^^y. If (x,y) E [a',&'] x [d\,d'^, 

j-b' rb" j-b'" 

fo,hix,y)> / / fo{fi,/3o + Pi^j.)(j)hAx- fJ-)4>hyiy- /3o- Pi^i)diJ.dPodPi 

Ja' Ja" Ja'" 

> Climb' - x)/hx) + ^x - a')/hx)){H{y - di)/hy) + md2 - y)/hy)) 

> ci/9. 

Now consider {x,y) £ [a',b'] x [d2,oo), 

nb' f-y f-y 



a' J a" J a 



II _ „iii 



> foix, y)mb' - x)/hx) + H{x - a')/hx)){l/2 + $((4 - d[)/hy) - 1) 

> fo{x,y)/9. 

The negative tail and both cases where x ^ [a' , b'],y G [d\, can be bounded in the same manner. 
Now consider (x,y) G [6', 00) x [^2,00), 

i-x r-y ry 

k,h{x,y)> III /o(/",/3o + /3i/")</'fc,(2;-^)0/iy(y-/3o-/3i^)c?/"d/3oc?/3i 



a' J a" J a'" 



> /o(x,y)(l/2 + - a')/hx) - l)(l/2 + - d[)/hy) - 1) 

> fo{x,y)/9. 
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We can do this for the other three regions to generate the function, 



log(3/o(x, y)/ci) if X, 2/ G [a', b'] x [d[,d'^] 
log(9) otherwise. 



The function g{x,y) dominates log(/o//o,/i) and is P/p integrable. Using dominated convergence 
and a variant of Fatou's lemma, f /o log(/o//o,/i) sts hx, hy ^ 0. □ 

Theorem A-5. Define fh,p{x,y) as 

fh,p{x,y) = j (f)h,{x - li)4>hy{y - /3o - Pix)dP{ix,Po,Pi). 

Suppose that fo{x, y) has compact support and all values of (/x, (3q, hx, hy) are in the weak support 
of n. Then for all e > 0, 



/o(x, y) log !^^^/^\ dxdy < e j > 0. 
fh,pix,y) J 



Proof. This proof is similar to the one presented in Ghosal et al. (1999). Consider the case where 



X is scalar; this assumed for simplicity and can easily be extended to the vector-valued case. Fix 
e > 0. Assume fo,h{x,y) is as above. Note that 



/ 



fo{x,y)loi 



foix,y) 
fh,p{x,y) 



fo{x,y) loi 



fo{x,y) , f r t M f0,h{^,y) (f, „x 

+ / fo{x,y)\og- — -. A-2 

J fh,p{x,y) 



fo,h{x,y) 



Using Theorem A-4, we can pick /io,a;, h^^y such that for every hx < ho^x, hy < /io,j/, 



. / M foix,y) ^ /„ 
fo[x, y) log — — < e/2. 

foMx,y) 



(A-3) 



Now we focus on the second term of Equation (A-2). Set 

V=[-x',x'] X [-y',y'] x [-y',y' 

Note that 

fo,h{x,y) 



fo{x,y)log 



fh,p{x,y) 



< 



Define 



J, /p/o(m,/3o + (3in)4>h^ix - i^)4>hy{y - (3o- f3ix)diJ.df3od(3 

x'j-y ' Iv^t^hAx - l^)4>hyiy - /3o - Pix)dP{iJ,,Po,Pi) 



dxdy. 



^h.^ [x - [i)4>hy {y- Po- Pix) : x G [-x, x'],y e [-y' , y'] } 



is a set of functions in (/i, /3o, /3i) G P; it is uniformly equicontinuous. Given this fact, we can use 
the Arzela-Ascoli theorem to find a finite tiling of {x,y) to approximate goix,y) arbitrarily well in 



the region x G [—x',x'], y G [—y',y']. The rest of the proof proceeds as in Ghosal et al. (1999). □ 



Theorem A-5 can easily be extended to include scale mixtures of CxjCTy (^Ghosal et al. 1999 



Tokdar|[2006l ). Theorem [5^ follows direct result of Theorem I A- 5 



All of the base measures proposed in Subse ctio n |5.2| have weak support over the desired space, 
, and hence satisfy Theorem 5.2 
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A-4 Proof of Theorem 15.31 and Results for Base Measures 



We follow the same plan for sketching the proof of Theorem 5.3 



Theorem A-6. Suppose that fo{x,k) has compact support. Define 



k,h{x,k)= I fo{fi,k)(f)hix - fi)du. 



Then, 



lim 



fo{x, k) log /°^^' ^\ dxdh = 0. 



The proof is similar to that of Theorem A-4 
Theorem A-7. Define fh,p{x,k) as 

f f i\ [ A ( ^ exp(/3o,fc - ^ ^ ^ 

Jh,p{x,k)= I (l)h{x-n)—j^ — -dP{fi,(3o^i,. . . ,(3i^k)- 

Lj=i exp(/3o,i + Pi,jX) 

Suppose that fo{x,k) has compact support and all values of {n, Po^i, . . . , f3i^K,h) are in the weak 
support o/n. Then for all e > 0, 



n |p : / fo{x, k) log -P^^^dxdk < el > 0. 



Proof. Assume fo^h{x,k) is as above. Note that 



/o(x,fc)log 



fo{x,k) 
f{x,k) 



/o(x, A;)log 



fo{x,k) 
fo,h{x,k) 



+ / /o(2;,A;)lo^ 



fo,h{x, k) 
fh,p{x,k)' 



Using Theorem A-4, we can pick ho such that for every h < /io,x, 

f f ,M fo{x,k) 
/o(x, k) log — - — - < e/4. 

fo,h{^^'^) 



(A-4) 



(A-5) 



Now we focus on the second term of Equation (A-4). We can make /o,/i(x,A;) arbitrarily close 
to a density with bounded response probabilities that are still continuous in x. Choose 7 > 0; 
for the sake of simplicity, assume k = 1,2 (the outcome space can easily be expanded). Writing 

fo{x,k) = fo{x)fo{k\x), set 

fo{k\x) if /o(A;|x)G [7,1-7], 
f^{k\x) = { 7 if /o(fc|x)G [0,7), 

1-7 if /o(A;|x)G (1-7,1]. 



Then, for any x € [—x',x'] and k € {1,2}, 

/ fo{lJ')fo{k\l^)4>h{x - fi)d^i 



I foifJ')fj{k\lJ')4'hix - fi)dfj. 



< 27. 



Note that 



fo{x,k)\og 



fo,h{x,k) 
fh,p{x,k) 



fo{x,k)lo^ 



fo,hix,k) 

fh,-y{x, k) 



+ / /o(a:,A;)log|^ 

J Jh 



fh.'fix, k) 

(x,k)' 



< I /o(x,fc)log f-^f'^,U 27. 

fh,p{x,k) 
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Finally, we approximate by something that has the desired GLM form. Fix < 6 < j/lAK). 
Define 

/ K K 
9s{yi, ■ ■ ■,yk\x) = max I 0,5^/^(A;|x) - ^ 

V fc=i k=l 



Eliexp(yj) 



fj{k\xj 



for y € [—6, b] for some bound b. This produces a triangular box of "close" conditional probabilities 
around the true conditional probability. Define fh,-y,s as 



fhn,s{x,k) 



/o(/u) 



bo]^ J[-biMV^ Il-bo,bo]'< /[-bi,6iF^ 5"5(^o + /^iAi> k\fj,)d/3odl3i 



(A-6) 



^ ^ exp(/3o,fc + /3i,fcM) , ,o 
Li=i exp(/3o,j + Pijfi) 

Because f^{k\x) is bounded away from and 1, we can find finite 6o, bi such that 

max gsiPo,! + Pi,ix, (3o^K + Pi,kx\x) = 6 

/3oe[-feo,M^./3ie[-bi,bi]-^ 

for every x £ [— x', x']. Let L be the maximum Lipschitz constant of fo{k\x) for x G [— x', x']; these 
exist because /o(/c|x) is continuous in x, which has a compact domain. Then, observe that 



/ h{lj)fiik\lj)(t>h{x - n)dn _ ^ 



< 2KL6. 



Therefore, 



/o(x,fc)lo^ 



fh,'-f{x, k) 

fh,p{x,k) 



fo{x,k)log 



fh,~f{x, k) 



fh,'y,5{x 



^ + y" /o(x,A:)log 



fh,'y,s{x,k) 

fh,p{x,k) ' 



Set 



/o(x,A;)log-— ^- — — 
fh,p{x,k) 

exp(/3o,A; + /3i,fex) 



+ < 2KL6. 



:F=\Mx-f^) ' "^'7^ , : X G i-x', x'] , A; G {1, . . . , . 

[ Lj=i exp(/3o,j + Pijx) J 

It is equicontinuous for (/x,/3o,/3i) G [— x',x'] x [—bo,bo]^ x [— using this fact, we proceed 
in the same manner as in Theorem IA-51 

□ 



Theorem 5.3 follows from Theorem A- 7 As in the Gaussian example, all of the base measures 
given have sufficient weak support. 



A-5 CMB Computational Details 

The DP-GLM was run on the largest data size tested several times; log posterior probabilities were 
evaluated graphically, and in each case the posterior probabilities seem to have stabilized well before 
1,000 iterations. Therefore, all runs for each sample size were given a 1,000 iteration burn-in with 
samples taken every 5 iterations until 2,000 iterations had been observed. The scaling parameter 
a was given a Gamma prior with an initial value set at 1. The means and variances of each 
component and all GLM parameters were also given a log-normal hyper distribution. The model 
was most sensitive to the hyper-distribution on ay, the GLM variance. Small values were used 
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(log(my) ~ A^(— 3,2)) to place greater emphasis on response fit. The non-conjugate parameters 
were updated using the Hamiltonian dynamics method of Neal (2010). 

Models with both conjugate and log-normal base measures were tried on this dataset. The 
conjugate base measure model did not capture heteroscedasticity well, so only the log-normal base 
measure was used for the remaining tests on this dataset. 

This model was implemented in Matlab; a run on the largest dataset took about 500 seconds. 



A-6 CCS Computational Details 

Again, the DP-GLM was run on the largest data size tested several times; log posterior probabilities 
were evaluated graphically, and in each case the posterior probabilities seem to have stabilized well 
before 1,000 iterations. Therefore, all runs for each sample size were given a 1,000 iteration burn- 
in with samples taken every 5 iterations until 2,000 iterations had been observed. The scaling 
parameter a was given a Gamma prior with an initial value set at 1. The hyperparameters of the 
conjugate base measure were set manually by trying different settings over four orders of magnitude 
for each parameter. 

All base measures were conjugate, so the sampler was fully collapsed, a was updated using 



Hamiltonian dynamics (Neal 2010). Original results were generated by Matlab; the longest run 
times were about 1000 seconds. This method has been re-implemented in Java in a highly efficient 
manner; the longest run times are now under about 10 seconds. Run times would likely be even 



faster if variational methods were used for posterior sampling (Blei and Jordan 2005). 



A-7 Solar Computational Details 

Again, the DP-GLM was run on the largest dataset size tested several times; log posterior probabil- 
ities were evaluated graphically, and in each case the posterior probabilities seem to have stabilized 
well before 1,000 iterations. Therefore, all runs for each sample size were given a 1,000 iteration 
burn-in with samples taken every 5 iterations until 2,000 iterations had been observed. The scaling 
parameter a was set to 1 and the Dirichlet priors to Dir{l, 1, . . . , 1). The response parameters 
were given a Gaussian base distribution with a mean set to and a variance chosen after trying 
parameters with four orders of magnitude. 

All covariate base measures were conjugate and the (3 base measure was Gaussian, so the sampler 
was collapsed along the covariate dimensions and used in the auxiliary component setting of Algo- 



rithm 8 of Neal (2000). The (3 parameters were updated using Metropolis-Hastings. Results were 
in generated by Matlab; run times were substantially faster than the other methods implemented 
in Matlab (under 200 seconds). 
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